-
Notifications
You must be signed in to change notification settings - Fork 13.3k
Remove deadlock when server is not acking our data #6107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove deadlock when server is not acking our data #6107
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good start, but a few problems I think I see:
. We need to reset _timeout
in the ::connect/::connectServer methods since the object can be reused.
. The delay() change worries me (BW on small and large packets) and does not seem to be necessary.
Thx
-EFP3
@sislakd, I've tested it and can say that delay(xxx) is not interrupted on a packet and delay() needs to be undone.
Run the script and you'll see a 10s long high and low period. While it's going, do a NC/telnet/etc. to the 8266 port 80. If delay() was interrupted on a packet, the 10S period would be shortened when the connection came in. It is not, it still is 10s until the LED flips. So, please undo the delay(timeout) change. |
@earlephilhower Have you modified WiFiServer::_accept method to include esp_schedule() in your test? delay call is interrupted only if something call esp_schedule(). The current version of WiFiServer doesn't include esp_schedule(). Thus, you cannot see that behavior. |
Revert of delay in _write_from_source method to esp_yield doesn't solve problem of deadlock when there is no ack coming. What will work is just yield() or optimistic_yield causing quick cycling in _write_from_source method allowing timeout to return from the method if the remote peer is not responding. |
@sislakd, I think maybe I wasn't clear. I was demonstrating that when you put in a That's the current behavior and people do depend on it. delay(100) is around 100ms, and not sometimes 1ms. SWSerial, others, etc. I'm not sure what you meant about changing WiFiClient::accept, but if you're talking about something which will globally cause So, what I'm saying is that delay(timeout) can only worsen TCP response since it will go idle no matter what even if an ack or whatever comes in, and that changing delay()'s semantics so that it sometimes returns as a random variable (since the TCP stack is async and anyone can send at any time) isn't going to work. |
It seems that you don't understand what I'm trying to explain. I will try to summarize my thinking in details and hopefully it will be much clearer than before.
Now back to your simple test application:
Now back to this PR:
Why this PR is proposed then:
Let me know whether this explanation makes sense to you and you understand why this PR is important to fix ClientContext implementation. Sorry for the long comment, but I don't know how much details from other code you know in details. |
@sislakd, thanks for your detailed explanation. The LWIP and packet interface is one place I've thankfully not had to dive into, and I think I understand where you're going here now. I was talking w /@devyte about this as well and I can see that the pattern of "delay(timeout)" with a "esp_schedule()" to short-circuit it is used in the ::connect method. Your last bit is clear and seems fine now that I get it.
I appreciate your recent PRs, but I do hope you understand that I'm only trying to be extremely careful with such a critical bit of the code and not trying to discourage you! |
To @d-a-v for final signoff. |
After @earlephilhower 's question, your explanation is crystal clear. You are mastering the nonos model of this core, also network management internals. As a side note, I'm a bit concerned about how _timeout is managed. It is defined by Stream:: (1000 by default), possibly modified by user and overwritten during SSL handshake (to 15000 with good reasons) and now per this PR restored to 5000. I'll leave @earlephilhower think of it since this is very specific to BearSSL client. |
After handshake, things go much faster. AES isn't nearly as expensive. |
I've added comment to introduced delay to be understandable while reading the code. |
Changes since 2.5.1 (to 2.5.2) Core ---- * Add explicit Print::write(char) (esp8266#6101) Build system ---- * Fix typo in elf2bin for QOUT binary generation (esp8266#6116) * Support PIO Wl-T and Arduino -T linking properly (esp8266#6095) * Allow *.cc files to be linked into flash by default (esp8266#6100) * Use custom "ElfToBin" builder for PIO (esp8266#6091) * Fail if generated JSON file cannot be read (esp8266#6076) * Moved 'Dropping' print from stdout to stderr in drop_versions.py (esp8266#6071) * Fix PIO issue when build environment contains spaces (esp8266#6119) Libraries ---- * Remove deadlock when server is not acking our data (esp8266#6107) * Bugfix for stuck in write method of WiFiClient and WiFiClientSecure until the remote peer closed connection (esp8266#6104) * Re-add original SD FAT info access methods (esp8266#6092) * Make FILE_WRITE append in SD.h wrapper (esp8266#6106) * Drop X509 after connection, avoid hang on TLS broken (esp8266#6065)
Changes since 2.5.1 (to 2.5.2) Core ---- * Add explicit Print::write(char) (#6101) Build system ---- * Fix typo in elf2bin for QOUT binary generation (#6116) * Support PIO Wl-T and Arduino -T linking properly (#6095) * Allow *.cc files to be linked into flash by default (#6100) * Use custom "ElfToBin" builder for PIO (#6091) * Fail if generated JSON file cannot be read (#6076) * Moved 'Dropping' print from stdout to stderr in drop_versions.py (#6071) * Fix PIO issue when build environment contains spaces (#6119) Libraries ---- * Remove deadlock when server is not acking our data (#6107) * Bugfix for stuck in write method of WiFiClient and WiFiClientSecure until the remote peer closed connection (#6104) * Re-add original SD FAT info access methods (#6092) * Make FILE_WRITE append in SD.h wrapper (#6106) * Drop X509 after connection, avoid hang on TLS broken (#6065)
esp_yield() now also calls esp_schedule(), original esp_yield() function renamed to esp_suspend(). Don't use delay(0) in the Core internals, libraries and examples. Use yield() when the code is supposed to be called from CONT, use esp_yield() when the code can be called from either CONT or SYS. Clean-up esp_yield() and esp_schedule() declarations across the code and use coredecls.h instead. Implement helper functions for libraries that were previously using esp_yield(), esp_schedule() and esp_delay() directly to wait for certain SYS context tasks to complete. Correctly use esp_delay() for timeouts, make sure scheduled functions have a chance to run (e.g. LwIP_Ethernet uses recurrent) Related issues: - #6107 - discussion about the esp_yield() and esp_delay() usage in ClientContext - #6212 - discussion about replacing delay() with a blocking loop - #6680 - pull request introducing LwIP-based Ethernet - #7146 - discussion that originated UART code changes - #7969 - proposal to remove delay(0) from the example code - #8291 - discussion related to the run_scheduled_recurrent_functions() usage in LwIP Ethernet - #8317 - yieldUntil() implementation, similar to the esp_delay() overload with a timeout and a 0 interval
esp_yield() now also calls esp_schedule(), original esp_yield() function renamed to esp_suspend(). Don't use delay(0) in the Core internals, libraries and examples. Use yield() when the code is supposed to be called from CONT, use esp_yield() when the code can be called from either CONT or SYS. Clean-up esp_yield() and esp_schedule() declarations across the code and use coredecls.h instead. Implement helper functions for libraries that were previously using esp_yield(), esp_schedule() and esp_delay() directly to wait for certain SYS context tasks to complete. Correctly use esp_delay() for timeouts, make sure scheduled functions have a chance to run (e.g. LwIP_Ethernet uses recurrent) Related issues: - esp8266#6107 - discussion about the esp_yield() and esp_delay() usage in ClientContext - esp8266#6212 - discussion about replacing delay() with a blocking loop - esp8266#6680 - pull request introducing LwIP-based Ethernet - esp8266#7146 - discussion that originated UART code changes - esp8266#7969 - proposal to remove delay(0) from the example code - esp8266#8291 - discussion related to the run_scheduled_recurrent_functions() usage in LwIP Ethernet - esp8266#8317 - yieldUntil() implementation, similar to the esp_delay() overload with a timeout and a 0 interval
If a server where WiFiClient or WiFiClientSecure is connected stop acking our outgoing tcp packets (network issue, server issue), write attempt can stuck forever (if nothing else is generating esp_schedule).
The PR propose to change esp_yield in ClientContext::_write_from_source to delay which can be interrupted by esp_schedule or after given timeout expires. In the next cycle _is_timeout() is matched and write attempt ends.
For WiFiClientSecure it took 2x 15 seconds to return from WiFiClientSecure::_write method if the remote peer is not acking out outgoing tcp packets and send buffer is already full. There is _run_until called from flush and the second _run_until called in the loop after which it get out of _write method because of -1. It seems that it would be good to have default timeout consistent with WiFiClient. Thus, PR proposes to reduce timeout from 15 seconds to 5 seconds in WiFiClientSecure after we have successful handshake with the remote peer.